Task 1: Problem definition¶

Objectives for the Data Pipeline:¶

Objective 1: Identify Location and Room-Type with Highest Investment Return:
This involves analyzing historical resale price data to determine which locations and types of rooms (e.g., 4-room, 5-room, etc.) have shown the highest return on investment over the past years. It aims to provide insights for potential property investors or homeowners looking to maximize their returns.

Objective 2: Predict Price Growth for Future Using Machine Learning:
This objective involves building predictive models using machine learning algorithms on historical resale price data to forecast future price growth. By leveraging past trends and patterns, the goal is to provide stakeholders with insights into potential future price movements, aiding in decision-making processes related to property investment.

Measurement of Success:¶

The success of the data pipeline can be measured based on the following criteria:

  • Accuracy of Predictions: For Objective 1, success can be measured by the accuracy of identifying locations and room types with the highest investment return based on historical data analysis. For Objective 2, success can be measured by the accuracy of the machine learning models in predicting future price growth.
  • Return on Investment (ROI): Success can also be evaluated based on the ROI achieved by stakeholders who utilize the insights provided by the data pipeline. Higher ROI indicates the effectiveness of the insights generated.
  • User Feedback: Soliciting feedback from stakeholders who utilize the insights generated by the data pipeline can also be a valuable measure of success. Positive feedback indicating that the insights are actionable and useful would signify success.

Use Cases for the Data Pipeline:¶

The data pipeline will serve various use cases including:

  • Reporting: Generating reports on historical resale price trends, identifying locations and room types with the highest investment return, and providing summaries of machine learning predictions for future price growth.
  • Analytics: Performing in-depth analysis of historical resale price data to uncover trends, patterns, and correlations that can inform decision-making related to property investment.
  • Machine Learning: Developing and deploying machine learning models to predict future price growth based on historical data, enabling stakeholders to make informed decisions about property investment strategies.

Task 2: Data collection/curation¶

In [1]:
import re
import datetime
import warnings
import pandas as pd
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from matplotlib import rcParams
import matplotlib.pyplot as plt
from pathlib import Path

warnings.filterwarnings('ignore')
# sns.set_theme()

# figure size in inches
rcParams['figure.figsize'] = 11.7,5.27

The primary dataset required for this analysis is the HDB resale price data, which contains information about the resale prices of HDB flats in different locations and with various room types over past years. This dataset can be downloaded from data.gov.sg, specifically from these links:

  1. Resale flat prices based on registration date from Jan-2017 onwards
  2. Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016
  3. Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014

The 3 csv files are concatenated together to form the starting dataset.

To work towards to the objectives defined in Task 1, we intend to proceed with the following steps:

  1. Data Cleaning and Preparation: Preprocess the data to handle missing values, outliers, and inconsistencies. Ensure proper formatting of variables such as location, room type, and price.

  2. Exploratory Data Analysis (EDA):

    • Conduct EDA on the integrated dataset to understand the distributions, correlations, and patterns within the data.
    • Identify potential features that correlate with investment return and price growth, such as location, room type, economic indicators, and demographic factors.
  3. Model Development:

    • Identify Features: Select relevant features for modeling based on insights from EDA and domain knowledge.
    • Machine Learning Algorithms: Choose appropriate machine learning algorithms (e.g., regression, ensemble methods) for predicting price growth.
    • Model Training: Train the selected models on historical data, utilizing techniques such as cross-validation to ensure robustness.
  4. Model Evaluation:

    • Performance Metrics: Evaluate model performance using metrics such as accuracy, precision, F1 score, and RMSE (Root Mean Squared Error) for regression tasks.
    • Validation: Validate models on holdout datasets or through time-series cross-validation to assess generalization performance.
  5. Interpretation and Insights:

    • Interpret model results to understand the factors driving investment return and price growth.
    • Extract actionable insights for stakeholders, such as identifying high-return locations and room types or predicting future price trends.

Task 3: Data preparation¶

Initial Loading of data¶

For loading the data we will be using, we create a folder to put the curated input data:

In [2]:
if not Path("./input").exists():
    Path('./input').mkdir()

The 3 files from 2012 to present day (2024) are downloaded from:

  1. Resale flat prices based on registration date from Jan-2017 onwards
  2. Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016
  3. Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014

and placed into the relative directory ./input as csvs.
This is provided in the accompanying folder (downloaded 19th March) however please redownload from the aforementioned sources to rebuild the dataset on the latest data if necessary.

The following reads the files into memory and shows the types of data available:

In [3]:
files = Path('./input').glob('*.csv')
all_files = []
for file in files:
    df = pd.read_csv(file)
    print(f"Rows: {len(df)}, file: {file.name}")
    all_files.append(df)
dfs = pd.concat(all_files)
dfs = dfs.sort_values('month')
Rows: 37153, file: ResaleFlatPricesBasedonRegistrationDateFromJan2015toDec2016.csv
Rows: 175110, file: ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv
Rows: 52203, file: ResaleFlatPricesBasedonRegistrationDateFromMar2012toDec2014.csv
In [4]:
#plotting the heatmap
corr_matrix = dfs.corr(numeric_only=True)
plt.figure(figsize = (8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image
In [5]:
# Taking a sample to show what the data looks like
dfs.sample(n=10)
Out[5]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease resale_price
5358 2015-04 YISHUN 4 ROOM 751 YISHUN ST 72 07 TO 09 91.0 New Generation 1984 68 428000.0
16267 2015-11 WOODLANDS 5 ROOM 557 WOODLANDS DR 53 01 TO 03 120.0 Improved 1999 83 377000.0
6540 2017-05 CHOA CHU KANG 4 ROOM 790 CHOA CHU KANG NTH 6 10 TO 12 110.0 Model A 1996 78 years 07 months 408000.0
83712 2020-11 JURONG EAST 4 ROOM 57 TEBAN GDNS RD 04 TO 06 92.0 Model A 2016 95 years 01 month 420000.0
109178 2021-10 SENGKANG 4 ROOM 217A COMPASSVALE DR 07 TO 09 92.0 Premium Apartment 2017 94 years 08 months 620000.0
38267 2018-10 SERANGOON EXECUTIVE 142 SERANGOON NTH AVE 1 01 TO 03 151.0 Maisonette 1986 67 years 02 months 680000.0
2464 2017-03 BISHAN 3 ROOM 406 SIN MING AVE 10 TO 12 64.0 Simplified 1987 69 years 02 months 332000.0
29528 2018-06 SEMBAWANG 4 ROOM 470 SEMBAWANG DR 07 TO 09 91.0 Model A 2000 81 years 04 months 315000.0
108609 2021-08 SENGKANG 4 ROOM 330B ANCHORVALE ST 07 TO 09 93.0 Model A 2015 93 years 485000.0
28176 2013-06 TAMPINES 5 ROOM 130 SIMEI ST 1 04 TO 06 122.0 Improved 1987 NaN 570000.0
In [6]:
dfs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 264466 entries, 1303 to 173931
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   month                264466 non-null  object 
 1   town                 264466 non-null  object 
 2   flat_type            264466 non-null  object 
 3   block                264466 non-null  object 
 4   street_name          264466 non-null  object 
 5   storey_range         264466 non-null  object 
 6   floor_area_sqm       264466 non-null  float64
 7   flat_model           264466 non-null  object 
 8   lease_commence_date  264466 non-null  int64  
 9   remaining_lease      212263 non-null  object 
 10  resale_price         264466 non-null  float64
dtypes: float64(2), int64(1), object(8)
memory usage: 24.2+ MB

Data cleaning and handling null values¶

From the above information, we can see that as remaining_lease is in the following format, ## years ## months, we will need to clean this data into a metric that would make sense / sortable as it is mostly continuous data. We can express this as [year].[month/12].

For the NaN values (Not a number, akin to None or Null in this case), we chose to impute the value based on this year (2024) - lease_commence_date:

In [7]:
# The value True is the number of na values in the dataset
dfs['remaining_lease'].apply(lambda x: pd.isna(x)).value_counts()
Out[7]:
remaining_lease
False    212263
True      52203
Name: count, dtype: int64

We can fill the null values using year sold - lease_commence_date and reassigning this back in remaining_lease as str to be consistent with the previous format for year only dates. As the months are not provided for lease_commence_date, we can just use the year sold

In [8]:
dfs['remaining_lease'] = dfs.apply(lambda row: (f"{datetime.datetime.strptime(row['month'], '%Y-%m').year - row['lease_commence_date']} years" 
                                              if pd.isna(row['remaining_lease']) else row['remaining_lease']), axis=1)
In [9]:
dfs['remaining_lease'].apply(lambda x: pd.isna(x)).value_counts()
Out[9]:
remaining_lease
False    264466
Name: count, dtype: int64

Another method could have been to leave the null values. We had investigated that it is possible to do so as the percentage null values is under a threshold that is found acceptable in literature;

In [10]:
print(f'Percentage null values {(52203 / (212263 + 52203)) * 100:.2f}%')
Percentage null values 19.74%

Knowing that imputing / filling in the values of null with the column average may bias the dataset from the other rows (ie the average price of town Orchard would unlikely yield the same average price for Tuas had their been any residential lots there), we decided to leave the null values as is.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8426774/
APA citation: Hyuk Lee, J., & Huber Jr., J. C. (2021). Evaluation of multiple imputation with large proportions of missing data: How much is too much? Iranian Journal of Public Health. https://doi.org/10.18502/ijph.v50i7.6626

Research here done multiple imputation on dataset with 20% missing values and 80% missing values found that higher missing values warrants a greater need for MI becayse MI produced less biased estimates under all missing mechanisms. Since less than 20% is still an acceptable threshold of missing value, the missing values in remaining_lease is kept.

However in this analysis, as year of month sold - lease_commence_date was available to serve as a way to calculate the approximate remaining lease for the house

In addition, the following were discovered:

  • resale_price all end with ,000 can be reduced to be expressed as ##k for simplification. (i.e. resale_price/1000)
  • The block infomation is irrelevant to the analysis, beacause we have town to analysis on, therefore we can drop this column
  • The street_name is also irrelevant the same reason as block, therefore we can drop this column
In [11]:
'''This part of the code take care of the remaining_lease column and replace it with a column called lease_left'''
pattern = re.compile("(?P<year>\d+)(?:\ (?P<month>\d+) months)?")

def lease_to_float(remaining_lease) -> float:
    """
    args:
        remaining_lease (str): form of "\d+ years \d+ months"
    returns:
        int: year match
        int: month match if any
        float64: 
            float representation of year with 
            month expressed as decimal hence sortable
    """
    if pd.isna(remaining_lease): 
        return None, None, None
    remaining_lease = str(remaining_lease)
    matches = pattern.findall(remaining_lease)
    matches = [i[0] for i in matches]
    if len(matches) < 2:
        matches.append(0)
    return matches + [float(matches[0])+(float(matches[1])/12)]

dfs = dfs.reset_index(drop=True)
# Apply the cleaning function to the data
dfs[['year', 'months', 'lease_left']] = pd.DataFrame(dfs['remaining_lease'].apply(lease_to_float).to_list(), index=dfs.index)

# Check if any data loss during the cleaning process
dfs[['lease_left', 'remaining_lease']].info()

# Create a column that is casted to datetime
dfs['month_dt'] = pd.to_datetime(dfs['month'], format='%Y-%m')

# drop the original column and save the cleaned data
dfs = dfs.drop(columns=['remaining_lease', 'month'])

dfs.to_csv('2012-3--2024-3.csv', index=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264466 entries, 0 to 264465
Data columns (total 2 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   lease_left       264466 non-null  float64
 1   remaining_lease  264466 non-null  object 
dtypes: float64(1), object(1)
memory usage: 4.0+ MB
In [12]:
# The code below drop block and street_name columns and reformats the resale_price column
dfs['resale_price_thousand'] = dfs['resale_price'] / 1000

dfs = dfs.drop(columns=['block', 'street_name'])
In [13]:
# Check the current data format
dfs.sample(n=5)
Out[13]:
town flat_type storey_range floor_area_sqm flat_model lease_commence_date resale_price year months lease_left month_dt resale_price_thousand
138390 JURONG WEST 3 ROOM 07 TO 09 74.0 Model A 1984 280000.0 64 03 64.250000 2019-05-01 280.0
190211 SENGKANG EXECUTIVE 10 TO 12 130.0 Apartment 2001 644000.0 78 08 78.666667 2021-06-01 644.0
6476 JURONG WEST 5 ROOM 01 TO 05 134.0 Model A 1984 454000.0 28 0 28.000000 2012-05-01 454.0
19855 GEYLANG 3 ROOM 10 TO 12 59.0 Improved 1977 330000.0 35 0 35.000000 2012-12-01 330.0
79650 QUEENSTOWN 5 ROOM 10 TO 12 117.0 Improved 2011 888000.0 94 0 94.000000 2016-07-01 888.0
In [14]:
dfs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264466 entries, 0 to 264465
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   town                   264466 non-null  object        
 1   flat_type              264466 non-null  object        
 2   storey_range           264466 non-null  object        
 3   floor_area_sqm         264466 non-null  float64       
 4   flat_model             264466 non-null  object        
 5   lease_commence_date    264466 non-null  int64         
 6   resale_price           264466 non-null  float64       
 7   year                   264466 non-null  object        
 8   months                 264466 non-null  object        
 9   lease_left             264466 non-null  float64       
 10  month_dt               264466 non-null  datetime64[ns]
 11  resale_price_thousand  264466 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(1), object(6)
memory usage: 24.2+ MB
In [15]:
dfs.describe()
Out[15]:
floor_area_sqm lease_commence_date resale_price lease_left month_dt resale_price_thousand
count 264466.000000 264466.000000 2.644660e+05 264466.000000 264466 264466.000000
mean 96.992172 1993.971195 4.797057e+05 64.436153 2018-08-27 11:57:46.381312 479.705659
min 31.000000 1966.000000 1.000000e+05 1.000000 2012-03-01 00:00:00 100.000000
25% 76.000000 1984.000000 3.638880e+05 55.000000 2015-10-01 00:00:00 363.888000
50% 94.000000 1993.000000 4.500000e+05 68.000000 2019-01-01 00:00:00 450.000000
75% 112.000000 2003.000000 5.633750e+05 82.000000 2021-09-01 00:00:00 563.375000
max 280.000000 2022.000000 1.568888e+06 97.750000 2024-03-01 00:00:00 1568.888000
std 24.243036 13.202825 1.594134e+05 24.063751 NaN 159.413352

Task 4: Exploratory data analysis and visualization¶

Understanding the data¶

To understand the data, we can define a helper function to help plot some categorical variables

In [16]:
#summary statistics of categorical data 
print(dfs[['town','flat_type','flat_model']].describe())
            town flat_type flat_model
count     264466    264466     264466
unique        26         7         21
top     SENGKANG    4 ROOM    Model A
freq       20288    109186      86288
In [17]:
#plotting some visulaisations 
plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='town', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of towns')

plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='flat_type', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of flat types')

plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='flat_model', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of flat models')

plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [18]:
cross_tab = pd.crosstab(dfs['flat_type'], dfs['flat_model'])
print(cross_tab)
flat_model        2-room  3Gen  Adjoined flat  Apartment  DBSS  Improved  \
flat_type                                                                  
1 ROOM                 0     0              0          0     0       119   
2 ROOM                85     0              0          0     1       832   
3 ROOM                 0     0              0          0   430     17026   
4 ROOM                 0     0             50          0  1259      3113   
5 ROOM                 0    28            235          0  1555     45311   
EXECUTIVE              0     0            157      10122     0         0   
MULTI-GENERATION       0     0              0          0     0         0   

flat_model        Improved-Maisonette  Maisonette  Model A  \
flat_type                                                    
1 ROOM                              0           0        0   
2 ROOM                              0           0     2008   
3 ROOM                              0           0    16370   
4 ROOM                              0           0    64504   
5 ROOM                             35           0     3406   
EXECUTIVE                           0        7475        0   
MULTI-GENERATION                    0           0        0   

flat_model        Model A-Maisonette  ...  Multi Generation  New Generation  \
flat_type                             ...                                     
1 ROOM                             0  ...                 0               0   
2 ROOM                             0  ...                 0               0   
3 ROOM                             0  ...                 0           24406   
4 ROOM                             0  ...                 0           13312   
5 ROOM                           459  ...                 0               0   
EXECUTIVE                          0  ...                 0               0   
MULTI-GENERATION                   0  ...               103               0   

flat_model        Premium Apartment  Premium Apartment Loft  \
flat_type                                                     
1 ROOM                            0                       0   
2 ROOM                           57                       0   
3 ROOM                         1115                       0   
4 ROOM                        14342                      81   
5 ROOM                        10057                      26   
EXECUTIVE                      2342                       0   
MULTI-GENERATION                  0                       0   

flat_model        Premium Maisonette  Simplified  Standard  Terrace  Type S1  \
flat_type                                                                      
1 ROOM                             0           0         0        0        0   
2 ROOM                             0           0       989        0        0   
3 ROOM                             0        3122      4146      139        0   
4 ROOM                             0        8590       108       12      431   
5 ROOM                             0           0      2889        0        0   
EXECUTIVE                         20           0         0        0        0   
MULTI-GENERATION                   0           0         0        0        0   

flat_model        Type S2  
flat_type                  
1 ROOM                  0  
2 ROOM                  0  
3 ROOM                  0  
4 ROOM                  0  
5 ROOM                215  
EXECUTIVE               0  
MULTI-GENERATION        0  

[7 rows x 21 columns]
In [19]:
plt.figure(figsize=(20, 6))
sns.heatmap(cross_tab, annot=True, cmap='YlGnBu', fmt='d')
plt.title('Cross-tabulation of Flat Type and Flat Model')
plt.xlabel('Flat Model')
plt.ylabel('Flat Type')
plt.show()
No description has been provided for this image
In [20]:
from scipy.stats import chi2_contingency
chi2, p, _, _ = chi2_contingency(cross_tab)
print(f"Chi-square statistic: {chi2}, p-value: {p}")
Chi-square statistic: 683545.8999649041, p-value: 0.0

Floor-size over time Trend Analysis:¶

  • Check the distribution of floor size and resale prices
  • Analyze the trend of flat sizes (floor area) resale price changes over time.
In [21]:
plt.figure(figsize=(120, 40))  # Adjust the figure size to accommodate both plots

# Plot for floor area
plt.subplot(1, 2, 1)  # 1 row, 2 columns, plot number 1
plt.hist(dfs['floor_area_sqm'], bins=50, edgecolor='black')  # Adjust the number of bins as needed
plt.title('Distribution of Floor Sizes', fontsize=100)
plt.xlabel('Floor Size (sqm)', fontsize=80)  # Add unit to the x-axis label
plt.ylabel('Frequency', fontsize=80)
plt.xticks(range(0, int(max(dfs['floor_area_sqm']))+10, 20))  # Set x-axis ticks with a step size of 10
plt.tick_params(axis='both', which='major', labelsize=80)  # Increase the size of the numbers on the axes

# Print a table with detailed statistics about floor area in the terminal
floor_area_stats = dfs['floor_area_sqm'].describe()
print("Floor Area Statistics:\n", floor_area_stats)

# Plot for resale price
plt.subplot(1, 2, 2)  # 1 row, 2 columns, plot number 2
plt.hist(dfs['resale_price_thousand'], bins=100, edgecolor='black')  # Adjust the number of bins as needed
plt.title('Distribution of Resale Prices', fontsize=100)
plt.xlabel('Resale Price (thousand $)', fontsize=80)  # Add unit to the x-axis label
plt.ylabel('Frequency', fontsize=80)
plt.xticks(range(0, int(max(dfs['resale_price_thousand']))+100, 100))  # Set x-axis ticks with a step size of 10
plt.tick_params(axis='both', which='major', labelsize=80)  # Increase the size of the numbers on the axes

# Print a table with detailed statistics about resale price in the terminal
resale_price_stats = dfs['resale_price_thousand'].describe()
print("Resale Price Statistics:\n", resale_price_stats)

plt.show()
Floor Area Statistics:
 count    264466.000000
mean         96.992172
std          24.243036
min          31.000000
25%          76.000000
50%          94.000000
75%         112.000000
max         280.000000
Name: floor_area_sqm, dtype: float64
Resale Price Statistics:
 count    264466.000000
mean        479.705659
std         159.413352
min         100.000000
25%         363.888000
50%         450.000000
75%         563.375000
max        1568.888000
Name: resale_price_thousand, dtype: float64
No description has been provided for this image



Below, we continue to plot the price changes over time, seperating them into bins according to floor area to observe how the different price / size changed over time:

In [22]:
dfs.sort_values('month_dt', inplace=True)
# Define bins for different ranges of sizes
bins = [0, 76, 94, 112, 280]  # Adjust these values as needed

# Create a new column for the size range
dfs['size_range'] = pd.cut(dfs['floor_area_sqm'], bins)

# Group by size range and month, then calculate the mean resale price
grouped = dfs.groupby(['size_range', 'month_dt'], observed=True)['resale_price_thousand'].mean().reset_index()

# Plot the mean resale price over time for each size range
for size_range in grouped['size_range'].unique():
    subset = grouped[grouped['size_range'] == size_range]
    plt.plot(subset['month_dt'], subset['resale_price_thousand'], label=str(size_range))

plt.title('Trend of Prices Over Time by Size Range')
plt.xlabel('Date')
plt.ylabel('Mean Price')
plt.legend(title='Size Range')
plt.show()
No description has been provided for this image

From these plots, we can observe the following:

  • The real estate market has seen an overall increase in prices from 2012 to 2024, with larger flats experiencing a more substantial rise. This suggests a sustained demand for larger properties, or a market perception of greater value in larger units.
  • Investors might consider focusing on larger properties for potentially higher returns.

Town affect on resale price over time¶

In [23]:
# Create a list of unique towns
towns = dfs['town'].unique()

# Create a scatter plot of resale prices over time for each town
plt.figure(figsize=(15, 10))
for town in towns:
    town_data = dfs[dfs['town'] == town]
    plt.scatter(town_data['month_dt'], town_data['resale_price_thousand'], label=town)

plt.title('Resale Price by Town Over Time')
plt.xlabel('Time')
plt.ylabel('Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')  # Place the legend outside the plot
plt.show()

# Grouping town data by month and taking the median to plot
town_medians = dfs.groupby(['town', 'month_dt'], as_index=False)[['resale_price_thousand']].median()
# Create a line plot of median resale prices over time for each town
plt.figure(figsize=(15, 10))
for town in towns:
    town_data = town_medians[town_medians['town'] == town]
    plt.plot(town_data['month_dt'], town_data['resale_price_thousand'], label=town)

plt.title('Median Resale Price by Town Over Time')
plt.xlabel('Time')
plt.ylabel('Median Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')  # Place the legend outside the plot
plt.show()
No description has been provided for this image
No description has been provided for this image

The following can be observed:

  • The provided visualizations reveal a market with increasing median housing prices over a 12-year span, marked by variability both within and across different towns.
  • The data suggests that certain areas may be experiencing greater price appreciation, possibly reflecting their desirability or other positive attributes.

Investigating flat_type's effect on resale price¶

In [24]:
# Get unique flat types
flat_types = dfs['flat_type'].unique()

# Create a scatter plot of resale prices over time for each flat type
plt.figure(figsize=(15, 10))
for flat_type in flat_types:
    flat_type_data = dfs[dfs['flat_type'] == flat_type]
    plt.scatter(flat_type_data['month_dt'], flat_type_data['resale_price_thousand'], label=flat_type)

plt.title('Resale Price by Flat Type Over Time')
plt.xlabel('Time')
plt.ylabel('Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')  # Place the legend outside the plot
plt.show()
No description has been provided for this image

From the above, we can observe:

  • There is a clear stratification of resale prices based on flat types, with larger flats generally selling for more than smaller ones
  • The data indicates a general increase in resale prices over the observed period for all flat types across time
  • Despite the general increase, there's considerable variability within each flat type category, implying that other factors also play a significant role in determining resale prices

We can also view the distribution of flat_model and storey_range's in our dataset:

In [25]:
# Count and plot the number of occurrences of each unique value in 'flat_model'
flat_model_counts = dfs['flat_model'].value_counts()
flat_model_counts.plot(kind='bar', figsize=(10, 5))
plt.title('Counts of Flat Model')
plt.xlabel('Flat Model')
plt.ylabel('Count')
plt.show()

# Count and plot the number of occurrences of each unique value in 'story_range'
story_range_counts = dfs['storey_range'].value_counts()
story_range_counts.plot(kind='bar', figsize=(10, 5))
plt.title('Counts of Storey Range')
plt.xlabel('Storey Range')
plt.ylabel('Count')
plt.show()
No description has been provided for this image
No description has been provided for this image



For a general overview, the following is a interactive visualisation of the distribution and affect of housing price over time:

In [26]:
%%html
<div class='tableauPlaceholder' id='viz1710941321740' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Si&#47;SingaporeAverageResalePrice&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SingaporeAverageResalePrice&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Si&#47;SingaporeAverageResalePrice&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-GB' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1710941321740');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='1027px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';vizElement.parentNode.insertBefore(scriptElement, vizElement);</script>
In [27]:
ax = sns.lineplot(dfs, x='month_dt', y='resale_price', errorbar=('ci', 95), estimator='median')
ax.set_title('Median resale price per month')
plt.show()
No description has been provided for this image
In [28]:
fig = make_subplots(rows=1, cols=2, subplot_titles=[
    'Distribution of resale price',
    'Distribution of floor_area_sqm'
])

trace0 = go.Histogram(
    x=dfs['resale_price'], 
    xbins={'size': 10000})

trace1 = go.Histogram(
    x=dfs['floor_area_sqm'],
    xbins={'size': 10}
)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(
    autosize=False,
    width=1000,
    height=500,
)
# Define x axis labels
fig.update_xaxes(title_text='Bins')

# Define y axis labels
fig.update_yaxes(title_text="Counts")
fig.show()
In [29]:
def plot_categorical_variables(df: pd.DataFrame, col: str) -> None:
    """Takes a dataframe and plot a counts for categorical variables
    args:
        df (pd.DataFrame): Dataframe to load data from
        col (str): Name of column to plot
    return:
        None
    """
    counts = dfs[col].value_counts()
    # Columnar text counts for reference
    print(f'Count of {col}')
    print(counts, end='')

    # Generate 
    counts = counts.reset_index()
    fig = px.bar(
        counts, 
         x=col, 
         y='count', 
         title=f'Counts of {col}',
         labels = {
            'count': f'Count of {col}'
         }
    )
    fig.show()
In [30]:
# Define and plot the categorical variables reference
categorical_variables = [
    'flat_model',
    'flat_type',
    'town',
]

for variable in categorical_variables:
    plot_categorical_variables(dfs, variable)
Count of flat_model
flat_model
Model A                   86288
Improved                  66401
New Generation            37718
Premium Apartment         27913
Simplified                11712
Apartment                 10122
Standard                   8132
Maisonette                 7475
Model A2                   3384
DBSS                       3245
Model A-Maisonette          459
Adjoined flat               442
Type S1                     431
Type S2                     215
Terrace                     151
Premium Apartment Loft      107
Multi Generation            103
2-room                       85
Improved-Maisonette          35
3Gen                         28
Premium Maisonette           20
Name: count, dtype: int64
Count of flat_type
flat_type
4 ROOM              109186
3 ROOM               66754
5 ROOM               64216
EXECUTIVE            20116
2 ROOM                3972
1 ROOM                 119
MULTI-GENERATION       103
Name: count, dtype: int64
Count of town
town
SENGKANG           20288
WOODLANDS          19513
JURONG WEST        18834
TAMPINES           18159
YISHUN             17754
PUNGGOL            16022
BEDOK              15187
HOUGANG            13387
CHOA CHU KANG      11791
ANG MO KIO         11644
BUKIT MERAH        10212
BUKIT BATOK        10182
BUKIT PANJANG       9612
TOA PAYOH           8312
PASIR RIS           8198
KALLANG/WHAMPOA     7912
QUEENSTOWN          7296
SEMBAWANG           6985
GEYLANG             6851
CLEMENTI            6194
JURONG EAST         5655
SERANGOON           5168
BISHAN              4818
CENTRAL AREA        2148
MARINE PARADE       1713
BUKIT TIMAH          631
Name: count, dtype: int64

Task 5: Modeling and visualization¶

In [31]:
# With this, we get the median resale price per month for each flat_type in each town
grouped_town_flat = dfs.groupby(['town', 'month_dt', 'flat_type'], as_index=False)[['resale_price_thousand']].median()
In [32]:
# Thereafter we sort by town, flat_type, and hence most importantly price and get the first and last entries representing 
# the lowest median price and highest median price for a given month
grouped_town_flat_month_median = grouped_town_flat.sort_values(['town', 'flat_type', 'resale_price_thousand'])\
                                    .groupby(['town', 'flat_type'])\
                                        .agg(['first', 'last'])
grouped_town_flat_month_median
Out[32]:
month_dt resale_price_thousand
first last first last
town flat_type
ANG MO KIO 2 ROOM 2019-09-01 2023-02-01 178.0 310.944
3 ROOM 2019-05-01 2024-02-01 265.0 403.000
4 ROOM 2020-05-01 2023-01-01 380.0 735.000
5 ROOM 2020-04-01 2024-01-01 525.0 992.944
EXECUTIVE 2018-01-01 2023-09-01 660.0 1180.000
... ... ... ... ... ...
YISHUN 3 ROOM 2020-05-01 2024-02-01 250.0 409.888
4 ROOM 2018-02-01 2024-03-01 325.0 542.000
5 ROOM 2018-02-01 2023-12-01 410.0 678.000
EXECUTIVE 2017-10-01 2022-10-01 499.0 875.400
MULTI-GENERATION 2016-09-01 2023-01-01 650.0 1080.000

131 rows × 4 columns

In [33]:
# lets rename the columns
grouped_town_flat_month_median.reset_index(inplace=True)
grouped_town_flat_month_median.columns = ['town', 
                                          'flat_type', 
                                          'month_min', 
                                          'month_max', 
                                          'resale_price_thousand_min', 
                                          'resale_price_thousand_max']
grouped_town_flat_month_median
Out[33]:
town flat_type month_min month_max resale_price_thousand_min resale_price_thousand_max
0 ANG MO KIO 2 ROOM 2019-09-01 2023-02-01 178.0 310.944
1 ANG MO KIO 3 ROOM 2019-05-01 2024-02-01 265.0 403.000
2 ANG MO KIO 4 ROOM 2020-05-01 2023-01-01 380.0 735.000
3 ANG MO KIO 5 ROOM 2020-04-01 2024-01-01 525.0 992.944
4 ANG MO KIO EXECUTIVE 2018-01-01 2023-09-01 660.0 1180.000
... ... ... ... ... ... ...
126 YISHUN 3 ROOM 2020-05-01 2024-02-01 250.0 409.888
127 YISHUN 4 ROOM 2018-02-01 2024-03-01 325.0 542.000
128 YISHUN 5 ROOM 2018-02-01 2023-12-01 410.0 678.000
129 YISHUN EXECUTIVE 2017-10-01 2022-10-01 499.0 875.400
130 YISHUN MULTI-GENERATION 2016-09-01 2023-01-01 650.0 1080.000

131 rows × 6 columns

From the following, we can use it to calculate Returns on investment (ROI) defined as:

$\text{ROI} = \frac{\text{Final value of investment} - \text{Initial value of investment}}{\text{Cost of investment}} \times 100\%$

Where in this case, as we are are finding the highest hypothetical return, would be where someone had the opportunity to buy low and sell high; becomes the following:

$\text{ROI} = \frac{\text{resale\_price\_thousand\_max - resale\_price\_thousand\_min}}{\text{resale\_price\_thousand\_min}} \times 100\%$

In [34]:
# Calculating ROI
grouped_town_flat_month_median['median_roi'] = ((grouped_town_flat_month_median['resale_price_thousand_max'] - \
                                            grouped_town_flat_month_median['resale_price_thousand_min']) / \
                                            grouped_town_flat_month_median['resale_price_thousand_min']) * 100
In [35]:
top_50 = grouped_town_flat_month_median.sort_values(by=['median_roi'], ascending=False).head(50)
top_50
Out[35]:
town flat_type month_min month_max resale_price_thousand_min resale_price_thousand_max median_roi
21 BUKIT MERAH 2 ROOM 2020-05-01 2024-02-01 170.000 523.000 207.647059
37 CENTRAL AREA 4 ROOM 2014-08-01 2023-08-01 420.000 1200.000 185.714286
80 PASIR RIS 3 ROOM 2019-10-01 2023-12-01 215.000 610.000 183.720930
38 CENTRAL AREA 5 ROOM 2012-03-01 2023-10-01 575.500 1438.888 150.023979
45 CLEMENTI 2 ROOM 2019-06-01 2023-05-01 205.000 492.000 140.000000
89 QUEENSTOWN 2 ROOM 2018-11-01 2023-12-01 185.000 431.500 133.243243
48 CLEMENTI 5 ROOM 2018-01-01 2024-01-01 526.000 1180.000 124.334601
35 CENTRAL AREA 2 ROOM 2020-06-01 2023-06-01 200.000 430.000 115.000000
47 CLEMENTI 4 ROOM 2019-04-01 2024-02-01 418.000 870.000 108.133971
52 GEYLANG 4 ROOM 2014-08-01 2023-10-01 377.000 770.000 104.244032
64 JURONG EAST EXECUTIVE 2017-01-01 2023-09-01 500.000 1020.000 104.000000
117 TOA PAYOH 4 ROOM 2015-04-01 2022-10-01 415.000 844.444 103.480482
10 BISHAN 3 ROOM 2016-01-01 2023-08-01 260.000 520.000 100.000000
5 BEDOK 2 ROOM 2019-12-01 2023-07-01 180.000 355.000 97.222222
120 WOODLANDS 2 ROOM 2018-11-01 2024-02-01 180.000 355.000 97.222222
2 ANG MO KIO 4 ROOM 2020-05-01 2023-01-01 380.000 735.000 93.421053
121 WOODLANDS 3 ROOM 2020-04-01 2024-02-01 223.000 430.000 92.825112
50 GEYLANG 2 ROOM 2019-10-01 2023-10-01 150.000 287.944 91.962667
107 SERANGOON 5 ROOM 2016-03-01 2024-01-01 455.888 871.000 91.055698
22 BUKIT MERAH 3 ROOM 2020-04-01 2022-02-01 283.500 540.000 90.476190
3 ANG MO KIO 5 ROOM 2020-04-01 2024-01-01 525.000 992.944 89.132190
65 JURONG WEST 2 ROOM 2018-12-01 2023-03-01 192.000 360.000 87.500000
118 TOA PAYOH 5 ROOM 2020-02-01 2024-03-01 550.000 1026.944 86.717091
9 BEDOK EXECUTIVE 2019-06-01 2024-03-01 590.000 1100.000 86.440678
30 BUKIT PANJANG EXECUTIVE 2016-07-01 2023-03-01 450.000 838.000 86.222222
119 TOA PAYOH EXECUTIVE 2019-07-01 2024-02-01 600.000 1114.444 85.740667
73 KALLANG/WHAMPOA 5 ROOM 2020-05-01 2023-08-01 523.500 970.000 85.291309
18 BUKIT BATOK 5 ROOM 2019-11-01 2022-06-01 470.000 865.000 84.042553
53 GEYLANG 5 ROOM 2020-01-01 2018-07-01 500.000 920.000 84.000000
71 KALLANG/WHAMPOA 3 ROOM 2020-04-01 2024-03-01 244.000 448.000 83.606557
55 HOUGANG 2 ROOM 2019-08-01 2024-02-01 205.000 375.000 82.926829
17 BUKIT BATOK 4 ROOM 2020-04-01 2023-06-01 332.000 603.500 81.777108
105 SERANGOON 3 ROOM 2019-09-01 2023-09-01 255.000 463.000 81.568627
34 BUKIT TIMAH EXECUTIVE 2012-03-01 2023-12-01 745.000 1350.000 81.208054
32 BUKIT TIMAH 4 ROOM 2012-07-01 2023-11-01 495.000 890.000 79.797980
4 ANG MO KIO EXECUTIVE 2018-01-01 2023-09-01 660.000 1180.000 78.787879
108 SERANGOON EXECUTIVE 2015-02-01 2022-10-01 607.000 1085.000 78.747941
19 BUKIT BATOK EXECUTIVE 2015-01-01 2023-05-01 535.000 954.000 78.317757
96 SEMBAWANG 4 ROOM 2020-04-01 2023-11-01 320.000 570.000 78.125000
97 SEMBAWANG 5 ROOM 2018-03-01 2023-12-01 370.000 650.000 75.675676
129 YISHUN EXECUTIVE 2017-10-01 2022-10-01 499.000 875.400 75.430862
98 SEMBAWANG EXECUTIVE 2018-03-01 2023-07-01 448.888 786.000 75.099357
74 KALLANG/WHAMPOA EXECUTIVE 2019-11-01 2023-03-01 655.000 1146.000 74.961832
72 KALLANG/WHAMPOA 4 ROOM 2014-10-01 2022-02-01 480.000 839.000 74.791667
0 ANG MO KIO 2 ROOM 2019-09-01 2023-02-01 178.000 310.944 74.687640
87 PUNGGOL 5 ROOM 2018-06-01 2024-01-01 424.000 740.000 74.528302
114 TAMPINES MULTI-GENERATION 2012-06-01 2024-03-01 610.000 1060.000 73.770492
124 WOODLANDS EXECUTIVE 2016-09-01 2024-03-01 530.000 915.000 72.641509
41 CHOA CHU KANG 3 ROOM 2017-03-01 2024-03-01 250.000 430.000 72.000000
13 BISHAN EXECUTIVE 2013-09-01 2024-02-01 755.000 1295.000 71.523179
In [36]:
print(top_50['flat_type'].value_counts())
print()
print(top_50['town'].value_counts())
flat_type
EXECUTIVE           13
2 ROOM              10
5 ROOM              10
4 ROOM               9
3 ROOM               7
MULTI-GENERATION     1
Name: count, dtype: int64

town
ANG MO KIO         4
KALLANG/WHAMPOA    4
SERANGOON          3
SEMBAWANG          3
CLEMENTI           3
GEYLANG            3
BUKIT BATOK        3
TOA PAYOH          3
CENTRAL AREA       3
WOODLANDS          3
BUKIT TIMAH        2
BUKIT MERAH        2
BEDOK              2
BISHAN             2
JURONG WEST        1
BUKIT PANJANG      1
JURONG EAST        1
HOUGANG            1
QUEENSTOWN         1
PASIR RIS          1
YISHUN             1
PUNGGOL            1
TAMPINES           1
CHOA CHU KANG      1
Name: count, dtype: int64

From the above, we can observe within the top 50 median resale prices over time:

  • Ang Mo Kio and KALLANG/WHAMPOA represented one of the highest towns for investing, as they have more flat_types that had a high ROI
  • Executive flats is represented with the highest ROI in terms of flat_types across many different towns (13 / 26 different towns are in the top 50)
In [37]:
import datetime
import numpy as np

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_selector
from sklearn.model_selection import train_test_split
# HistGradientBoostingRegressor was found to yield a higher accuracy than GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor 
In [38]:
from sklearn.linear_model import LinearRegression
In [39]:
dfs['month_timestamp'] = dfs['month_dt'].apply(lambda x: x.timestamp())
In [40]:
train_cols = [
    'month_timestamp',
    'town',
    'flat_type',
    'storey_range',
    'floor_area_sqm',
    'flat_model',
    'lease_commence_date',
    'lease_left'
]
In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    dfs[train_cols], dfs["resale_price"], test_size=0.25
)
In [42]:
dfs['flat_type'].drop_duplicates().sort_values().to_list()
Out[42]:
['1 ROOM',
 '2 ROOM',
 '3 ROOM',
 '4 ROOM',
 '5 ROOM',
 'EXECUTIVE',
 'MULTI-GENERATION']
In [43]:
dfs['storey_range'].drop_duplicates().sort_values().to_list()
Out[43]:
['01 TO 03',
 '01 TO 05',
 '04 TO 06',
 '06 TO 10',
 '07 TO 09',
 '10 TO 12',
 '11 TO 15',
 '13 TO 15',
 '16 TO 18',
 '16 TO 20',
 '19 TO 21',
 '21 TO 25',
 '22 TO 24',
 '25 TO 27',
 '26 TO 30',
 '28 TO 30',
 '31 TO 33',
 '31 TO 35',
 '34 TO 36',
 '36 TO 40',
 '37 TO 39',
 '40 TO 42',
 '43 TO 45',
 '46 TO 48',
 '49 TO 51']
In [44]:
ct = make_column_transformer(
    (OneHotEncoder(), make_column_selector('town|flat_model|flat_type')),
    (OrdinalEncoder(), make_column_selector('storey_range'))
, remainder=StandardScaler())
In [45]:
ct.fit(X_train, y=y_train)
Out[45]:
ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>),
                                ('ordinalencoder', OrdinalEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>),
                                ('ordinalencoder', OrdinalEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>
OneHotEncoder()
<sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>
OrdinalEncoder()
['month_timestamp', 'floor_area_sqm', 'lease_commence_date', 'lease_left']
StandardScaler()
In [46]:
model = Pipeline([('Transform', ct), 
                     ('todense', FunctionTransformer(lambda x: np.array(x.todense()), accept_sparse=True)),
                     ('hgbr', HistGradientBoostingRegressor(loss='squared_error'))])
In [47]:
model.fit(X_train, y=y_train)
Out[47]:
Pipeline(steps=[('Transform',
                 ColumnTransformer(remainder=StandardScaler(),
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>)])),
                ('todense',
                 FunctionTransformer(accept_sparse=True,
                                     func=<function <lambda> at 0x169f1d3a0>)),
                ('hgbr', HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('Transform',
                 ColumnTransformer(remainder=StandardScaler(),
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>)])),
                ('todense',
                 FunctionTransformer(accept_sparse=True,
                                     func=<function <lambda> at 0x169f1d3a0>)),
                ('hgbr', HistGradientBoostingRegressor())])
ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>),
                                ('ordinalencoder', OrdinalEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x169be2a50>
OneHotEncoder()
<sklearn.compose._column_transformer.make_column_selector object at 0x168ec50d0>
OrdinalEncoder()
['month_timestamp', 'floor_area_sqm', 'lease_commence_date', 'lease_left']
StandardScaler()
FunctionTransformer(accept_sparse=True, func=<function <lambda> at 0x169f1d3a0>)
HistGradientBoostingRegressor()

Now we investigate the accuracy by calling .score(X, y):

In [48]:
print(f'The model accuracy based on test set is:\n{model.score(X_test, y_test)}')
The model accuracy based on test set is:
0.9376782214799758
In [49]:
set(dfs['flat_model'])
Out[49]:
{'2-room',
 '3Gen',
 'Adjoined flat',
 'Apartment',
 'DBSS',
 'Improved',
 'Improved-Maisonette',
 'Maisonette',
 'Model A',
 'Model A-Maisonette',
 'Model A2',
 'Multi Generation',
 'New Generation',
 'Premium Apartment',
 'Premium Apartment Loft',
 'Premium Maisonette',
 'Simplified',
 'Standard',
 'Terrace',
 'Type S1',
 'Type S2'}
In [50]:
print('This timestamp will be used as time now', timenow:= datetime.datetime.now().timestamp())
This timestamp will be used as time now 1712385087.839204
In [51]:
prediction = model.predict(pd.DataFrame(
    {
    'month_timestamp': timenow,
    'town': 'CLEMENTI',
    'flat_type': '3 ROOM',
    'storey_range': '01 TO 05',
    'floor_area_sqm': 121.0,
    'flat_model': 'Standard',
    'lease_commence_date': 2012,
    'lease_left': 87.8
}, 
index=[0]
))
In [52]:
print(f'The predicted resale price for the given parameters is: ${round(prediction[0])} SGD')
The predicted resale price for the given parameters is: $862447 SGD

We can also run this as a linear regression:

In [53]:
ct = make_column_transformer(
    (OneHotEncoder(), make_column_selector('town|flat_model|flat_type')),
    (OrdinalEncoder(), make_column_selector('storey_range')),
    (StandardScaler(), ["month_timestamp",
                        "floor_area_sqm",
                        "lease_commence_date",
                        "lease_left"]
)
, remainder='drop')
In [54]:
linear_model = Pipeline([('Transform', ct), 
                     ('linear', LinearRegression())])
In [55]:
linear_model.fit(X_train, y=y_train)
Out[55]:
Pipeline(steps=[('Transform',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x16463c390>),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x169a9fdd0>),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['month_timestamp',
                                                   'floor_area_sqm',
                                                   'lease_commence_date',
                                                   'lease_left'])])),
                ('linear', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('Transform',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x16463c390>),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x169a9fdd0>),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['month_timestamp',
                                                   'floor_area_sqm',
                                                   'lease_commence_date',
                                                   'lease_left'])])),
                ('linear', LinearRegression())])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x16463c390>),
                                ('ordinalencoder', OrdinalEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x169a9fdd0>),
                                ('standardscaler', StandardScaler(),
                                 ['month_timestamp', 'floor_area_sqm',
                                  'lease_commence_date', 'lease_left'])])
<sklearn.compose._column_transformer.make_column_selector object at 0x16463c390>
OneHotEncoder()
<sklearn.compose._column_transformer.make_column_selector object at 0x169a9fdd0>
OrdinalEncoder()
['month_timestamp', 'floor_area_sqm', 'lease_commence_date', 'lease_left']
StandardScaler()
LinearRegression()
In [57]:
print(f'The model accuracy based on test set is:\n{linear_model.score(X_test, y_test)}')
The model accuracy based on test set is:
0.8259606271192403
In [58]:
print(f"Intercept: {linear_model.named_steps.linear.intercept_}")
for name, coef in zip(ct.get_feature_names_out(), linear_model.named_steps.linear.coef_):
    print(name.split('__')[1], coef)
Intercept: 554692.0109006135
town_ANG MO KIO 24254.649060333482
town_BEDOK 6285.5578124036965
town_BISHAN 107493.43656027
town_BUKIT BATOK -52033.90216840438
town_BUKIT MERAH 124487.78857688299
town_BUKIT PANJANG -108912.09797593419
town_BUKIT TIMAH 208899.3963694276
town_CENTRAL AREA 119979.05240283573
town_CHOA CHU KANG -138535.46465697858
town_CLEMENTI 57155.645152225035
town_GEYLANG 54956.99860410129
town_HOUGANG -51990.485583400325
town_JURONG EAST -44584.60419456006
town_JURONG WEST -103011.0544715051
town_KALLANG/WHAMPOA 77308.51507036905
town_MARINE PARADE 164440.2321220538
town_PASIR RIS -78496.1823644678
town_PUNGGOL -99103.4860887537
town_QUEENSTOWN 131764.6653719128
town_SEMBAWANG -144580.1834382062
town_SENGKANG -112890.58404433684
town_SERANGOON 19293.8027585792
town_TAMPINES -23784.737846081494
town_TOA PAYOH 71039.02553369908
town_WOODLANDS -128516.07913615696
town_YISHUN -80919.9034263563
flat_type_1 ROOM -50678.311124033
flat_type_2 ROOM -38169.80026639458
flat_type_3 ROOM -5358.021102474688
flat_type_4 ROOM 9878.288097447414
flat_type_5 ROOM 24790.793485242524
flat_type_EXECUTIVE 26596.558758818494
flat_type_MULTI-GENERATION 32940.49214962674
flat_model_2-room -92325.96223161928
flat_model_3Gen -40036.12442199111
flat_model_Adjoined flat -12488.316993802195
flat_model_Apartment -47976.7662999336
flat_model_DBSS 36374.77667489754
flat_model_Improved -89046.35567500415
flat_model_Improved-Maisonette 101956.35560678365
flat_model_Maisonette -8803.191171599225
flat_model_Model A -90231.41464190934
flat_model_Model A-Maisonette 28941.996270884025
flat_model_Model A2 -91559.21371984224
flat_model_Multi Generation 32940.49214962674
flat_model_New Generation -70315.55041071167
flat_model_Premium Apartment -79423.66289840813
flat_model_Premium Apartment Loft 45642.70238398403
flat_model_Premium Maisonette -20068.246105312217
flat_model_Simplified -71393.82524228723
flat_model_Standard -69352.56526723651
flat_model_Terrace 261451.3710296846
flat_model_Type S1 106254.69828758152
flat_model_Type S2 169458.8026723239
storey_range 8230.987040352797
month_timestamp 38417.67012072079
floor_area_sqm 88451.65478397888
lease_commence_date 78938.95389395114
lease_left -33963.788146583516
In [59]:
l_prediction = linear_model.predict(pd.DataFrame(
    {
    'month_timestamp': timenow,
    'town': 'CLEMENTI',
    'flat_type': '3 ROOM',
    'storey_range': '01 TO 05',
    'floor_area_sqm': 121.0,
    'flat_model': 'Standard',
    'lease_commence_date': 2012,
    'lease_left': 87.8
}, 
index=[0]
))
In [60]:
l_prediction[0]
Out[60]:
768602.2504917412

Task 6: Report insights and conclusions¶

In the report, describe all the tasks above. You can describe each task at relevant places in the
notebook rather than writing a long essay at the beginning/in the end. In your reports, have a
section that explicitly presents your insights into the problem being solved and the conclusions
you draw.
In [ ]: